Model Selection

English Visual Question Answering

# English Visual Question Answering

Qwen2.5-VL-7B-Instruct is a multimodal model based on the Qwen2.5 architecture, supporting joint processing of images and text, suitable for vision-language tasks.

Safetensors English

Gemma 3 27b It Abliterated Mlx Vlm 4Bit

This model is a multimodal model in MLX format converted from huihui-ai/gemma-3-27b-it-abliterated, supporting the processing from image and text to text.

Transformers English

Open-Qwen2VL is a multimodal model capable of receiving both images and text as input and generating text output.

Image-to-Text English

Smolvlm2 256M Video Instruct Mlx

This is a video-text-to-text model converted based on the MLX framework, suitable for video understanding and instruction-following tasks.

Transformers English

Qwen2 VL 7B Instruct GGUF

Qwen2-VL-7B-Instruct is a 7B-parameter multimodal model supporting image-text interaction tasks.

Image-to-Text English

Pix2struct Vizwizvqa Base

This is a visual question answering model based on the Apache-2.0 license, supporting the English language, and focusing on handling vision-related question answering tasks.

Transformers English

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase